Comparative Evaluation of Automatic Named Entity Recognition from Machine Translation Output

نویسندگان

  • Bogdan Babych
  • Anthony Hartley
چکیده

We report the results of an experiment on automatic NE recognition from Machine Translations produced by five different MT systems. NE annotations are compared with the results obtained from two highquality human translations. The experiment shows that for recognition of a large class of NEs (Person Names, Locations, Dates, etc.) MT output is almost as useful as a human translation. For other types of NEs (Organisation Names) Precision figures are close to the results for human annotation, although Recall is seriously distorted by the degraded quality of MT. The success rate of NE recognition doesn’t strongly correlate with human or automatic MT evaluation scores, which suggests that the quality criteria needed for measuring MT usability for dissemination purposes are not pertinent for assimilation tasks such as Information Extraction. 1. Dissemination vs assimilation Since the 1960’s the ‘Holy Grail’ of Machine Translation technology has been Fully Automatic High Quality Translation (FAHQT), which aims at creating accurate and fluent texts in a target language suitable for dissemination (i.e. publication) purposes – a goal which has yet to be achieved. However, there are successful attempts and suggestions to use ‘crummy’ MT output (Church and Hovy, 1993) for assimilation (i.e. comprehension) tasks: text classification, relevance rating, information extraction (White et al., 2000), for NLP tasks such as Cross-Language Information Retrieval (Gachot et al., 1998), and Multilingual Question Answering (a new task set up for CLEF 2003). Multilingual Information Extraction is one such assimilation task and consequently an area where imperfect MT output is potentially useful. On the one hand MT can extend the reach of existing monolingual IE systems by translating a text before running IE; on the other hand, results of IE (identified Named Entities, template elements or scenario templates) can be translated into a foreign language after IE processing (Wilks, 1997: 7-8). The first scenario is more demanding for MT, because the performance of an automatic IE system may be influenced by MT quality. There is an open question: Which aspects of MT quality are important for different IE tasks and may substantially influence the performance of IE? MT quality is often benchmarked from the viewpoint of human users (White et al., 1994), focusing still on the goal of FAHQT for dissemination. As a result, automatic evaluation scores, such as BLEU (Papineni et al., 2002), are validated according to how well they correlate with human intuitive judgements of translation quality. Using edit distances between MT output and a human reference translation to evaluate MT (Akiba et al., 2001) also makes an implicit assumption that MT should be suitable for dissemination purposes. However, MT has created its own demand precisely in the area where otherwise there would be no translation at all. Where it is primarily used for assimilation purposes, the evaluation of NLP performance on MT output might give a better indication of its usefulness than dissemination criteria. Therefore there is a need for: (1) systematically benchmarking NLP technologies, such as IE (and its sub-tasks, e.g., NE recognition), on MT output; (2) developing and calibrating automatic MT evaluation scores for these primary uses of ‘crummy’ MT; (3) assessing quantitatively the extent to which certain human and automatic MT evaluation scores predict the performance of automatic systems on different NLP tasks. 2. Set-up of the experiment We addressed some of the above issues by conducting a comparative evaluation of the performance of the ANNIE NE recognition module of Sheffield’s GATE IE system (Gaizauskas et al., 1995; Cunningham et al., 1996, 2002). We used the DARPA-94 corpus of French-English MT and human translations (White et al., 1994). The MT systems were Candide, Globalink, Metal, and Systran (participants in DARPA), plus Reverso. Specifically, we focused on whether there is a significant divergence between NE recognition performance and the results of human and automatic evaluation of the MT systems. This indicates to what extent MT quality criteria may differ for human use and for the needs of NLP systems. In the first stage NEs were annotated in translations of 100 news reports (each text is about 350 words), produced by each MT system. NEs were also annotated in the two independent human translations of the same 100 texts: the Reference and the Expert translations. Comparative evaluation of this NE annotation is different from standard evaluation procedure for NE recognition in two respects. The first difference is that in our experiment there is no gold standard NE annotation for any of the human translations or MT outputs. The second difference is that the annotated text is no longer constant. 2.1 Absence of a gold standard Since all seven sets of texts are different, it would be too expensive to produce a gold standard annotation for each of them. However, all these texts have the same origin: all are translations of the same collection of French source texts, so it can be expected that there will be a great overlap between extracted NEs, namely for those typical cases when French NEs have a standard translation into English. While we expect that most types of NEs stay the same across different translations, we also have to account for possible variations. Two main things can go wrong when NEs are extracted from MT output (which is generally regarded to be of lower quality than a human translation): – NE recognition often relies on certain contextual conditions being met, so if a lexical or morphosyntactic context is distorted in MT output, NEs will be not extracted, resulting in NE ‘undergeneration’; likewise the distorted context may give rise to false NEs, leading to NE ‘overgeneration’. – If NEs are wrongly translated despite the context meeting the requirements of the NE recognition system, they are of no use in any other NLP tasks. The goal of our comparative evaluation is to estimate to what extent the output of different MT systems and the alternative human translation are ‘robust’ against these two pitfalls, i.e., to what extent they may be useful for the IE purposes. This means that we are less interested in absolute performance figures for the NE recognition system, than in the comparison between its runs on the output of different MT systems. Furthermore, the accuracy scores for leading NE recognition systems are relatively high. The default settings of ANNIE NE modules produce between 80-90% Precision & Recall on news texts originally written in English (Cunningham et al., 2002). We assume that for comparable texts – human translations of news reports into English – NE recognition performance is similar. Therefore, for our purposes it is possible to use the NE annotation in one of the human translations as a reference, which will serve as a ‘silver standard’ for benchmarking NE recognition performance from ‘low quality’ MT texts. The baseline for such comparisons will be the NE annotation in the other human translation: it will indicate what difference in accuracy may be expected if an alternative high-quality translation is used. This allows us to: (1) estimate the relative performance of the NE recognition system on texts with variable quality; (2) compare these relative figures with human and automatic MT evaluation scores; (3) answer the question whether usefulness of MT for IE should be characterised by criteria other than Adequacy and Fluency, or whether these correctly predict the potential performance of NE recognition. 2.2 Legitimate variation in translation In our research the annotated text is no longer constant – it becomes changeable; on the contrary, it is the NE recognition system that is constant. This requires a different interpretation of the figures for Precision, Recall and F-score: strictly speaking they only characterise differences rather than the degree of perfection. Annotation mismatches do not necessarily mean deterioration; they may be also due to the improved performance of NE recognition on the test file, or due to choosing a legitimate alternative translation. For example, we expect that NEs normally have a standard translation and will not vary across different human translations; therefore the quality of MT systems depends on how well this standard is followed. The only exceptions to this rule should be less well known organisations which do not have an established translation. But surprisingly, some degree of legitimate variation was found in human translations for well-known institutions also: ORI: De son côté, le département d'Etat américain, dans un communiqué, a déclaré: ‘Nous ne comprenons pas la décision’ de Paris. HT-Expert: For its part, the American Department of State said in a communique that ‘We do not understand the decision’ made by Paris . HT-Reference: For its part, the American State Department stated in a press release: We do not understand the decision of Paris

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Selecting Translation Strategies in MT using Automatic Named Entity Recognition

We report on the results of an experiment aimed at enabling a machine translation system to select the appropriate strategy for dealing with words and phrases which have different translations depending on whether they are used as proper names or common nouns in the source text. We used the ANNIE named entity recognition system to identify named entities in the source text and pass them to MT s...

متن کامل

Improving Machine Translation Quality with Automatic Named Entity Recognition

Named entities create serious problems for state-of-the-art commercial machine translation (MT) systems and often cause translation failures beyond the local context, affecting both the overall morphosyntactic well-formedness of sentences and word sense disambiguation in the source text. We report on the results of an experiment in which MT input was processed using output from the named entity...

متن کامل

tRuEcasIng

Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecasing issues and proposes a statistical, language modeling based truecaser which achieves an accuracy of ∼98% on news articles. Task based evaluation shows a 26% F-measure improvement in named entity recognition when using truecasing. In the context of automatic content extraction, ...

متن کامل

بهبود شناسایی موجودیت‌های نامدار فارسی با استفاده از کسره اضافه

Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...

متن کامل

تشخیص اسامی اشخاص با استفاده از تزریق کلمه‌های نامزد اسم در میدان‌های تصادفی شرطی برای زبان عربی

Named Entity Recognition and Extraction are very important tasks for discovering proper names including persons, locations, date, and time, inside electronic textual resources. Accurate named entity recognition system is an essential utility to resolve fundamental problems in question answering systems, summary extraction, information retrieval and extraction, machine translation, video interpr...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004